-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kraken2] Split database from Kraken2 TheiaCoV task #608
Conversation
…base for theiacov wfs
…sequencing platform
…om/theiagen/public_health_bioinformatics into jro-kraken-split-database-and-task
…ken DB used for theiacov workflows
… se test workflows
I think this is looking pretty okay. It's a quite sizable PR as a lot of issues started to get pooled into the same PR. In short, the main changes are:
Additional tests that could be done that I personally didn't check but James did:
|
@@ -227,8 +230,8 @@ workflow theiacov_illumina_pe { | |||
num_reads_raw2 = read_QC_trim.fastq_scan_raw2, | |||
num_reads_clean1 = read_QC_trim.fastq_scan_clean1, | |||
num_reads_clean2 = read_QC_trim.fastq_scan_clean2, | |||
kraken_human = read_QC_trim.kraken_human, | |||
kraken_human_dehosted = read_QC_trim.kraken_human_dehosted, | |||
kraken_human = read_QC_trim.kraken2_human, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you update the qc_check_task to use kraken2
as well for consistency? probably going to have to make this change in the export_taxon_table task as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kraken2 is kraken2 everywhere now
@@ -37,6 +37,8 @@ workflow theiacov_clearlabs { | |||
String? target_organism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be kraken2_target_organism? i feel like we discussed this at one point in time and cannot remember the conclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a conflict with the output declaration in this particular case so I'm unsure how to proceed here. I0m leaving it as target_organism
until we can discuss it further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on theiacov_ont is also target_organism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be an instance we'd want this input to feed into multiple tasks, to ensure consistency? My gut reaction to having workflow level inputs be specific to tasks, is we should only do that if we could/would not use it in multiple workflows. This, to me, seems like an instance we might want to leave it without any tool name to allow for just that. Maybe worth some discussion? Ultimately, when we get to the PHB-lite effort, this will get abstracted away, but something to consider in the interim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure that's what we're using the organism_parameter workflow to do in order to standardize the language here. This input is piped into organism_parameter
and that output is fed into all downstream tasks
Could you help me understand what you mean by having workflow level inputs specific to tasks if they're not used in multiple workflows?
If I'm understanding correctly, then we might be better served by removing all parameters that can be modified in the organism_parameters workflow so that they're not workflow-level inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great -- one question: if our aim is to standardize the use of kraken2 vs kraken, we should probably use kraken2 everywhere and not pick and choose.
Please go through and standardize all the kraken
instances to kraken2
in both docs and all workflows & tasks.
Testing TheiaProk PE here, TheiaMeta here and Kraken2_PE here |
Other impacted workflows that require testing:
|
Okay, I think everything is ready for retesting!!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cimendes I know this will be a pain to go back and refactor, but we need to avoid changing workflow-level output namespaces for modifications such as "harmonizing" the use of kraken/kraken2. i.e. have what is output to the Terra table remain the same, but we can change what defines that variable.
For example
Float kraken2_human_dehosted = kraken2.kraken2_percent_human |
Float kraken_human_dehosted = kraken2.percent_human
# should become
Float kraken_human_dehosted = kraken2.kraken2_percent_human
# should NOT become
Float kraken2_human_dehosted = kraken2.kraken2_percent_human
Changing all instances of kraken
, including the output namespaces, introduces a backwards incompatible change that would pose some pain points to the end user. I know some of this may work may have begun with James, but let's ditch those, where possible and only make those changes "under the hood" so our users are unimpacted. Happy to discuss more, if you like.
This PR closes #563, closes #520 and closes #402
🗑️ This dev branch should be deleted after merging to main.
🧠 Summary
This PR addresses several issues that came up when the RSV target organism was set differently for RSV-A and RSV-B, i.e.
rsv_a_kraken_target_organism = "Respiratory syncytial virus"
andrsv_b_kraken_target_organism = "Human orthopneumovirus"
due to the use of an old database. As the existing docker image had an embedded database, it meant that each time the Kraken database would need to be updated, it would have to be done inside the docker container, which would not be efficient. The TheiaProk workflows were already using a standalone database and a docker image free of a database, allowing for flexibility in user input for a preferred database.We have therefore implemented the same strategy to the TheiaCoV workflows, using the standalone Kraken2 task, and a default custom built Kraken2 database comprising Human and Viral sequences.
Further, for consistency, we have made all Kraken 2 outputs to be prefixed by "Kraken2", which previously had a mix of
Kraken2_
andKraken_
⚡ Impacted Workflows/Tasks
This PR may lead to different results in pre-existing outputs: Yes
Use of an updated database may reflect changes in how taxa are identified.
This PR uses an element that could cause duplicate runs to have different results: No
🛠️ Changes
⚙️ Algorithm
We have removed the
kraken2_theiacov
task and instead using thekraken2_standalone
task for the TheiaCoV workflows➡️ Inputs
The TheiaCoV workflows, Illumina PE, Illumina SE, and ONT, have a default Kraken2 database that includes Human and Viral sequences from NCBI
File kraken_db = "gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz"
⬅️ Outputs
All Kraken2 outputs are now prefixed with "Kraken2_"
Float? kraken2_human
Float? kraken2_human_dehosted
Float? kraken2_sc2
Float? kraken2_sc2_dehosted
String? kraken2_target_organism
String? kraken2_target_organism_dehosted
String? kraken2_target_organism_name
String? kraken2_version
File? kraken2_report
File? kraken2_report_dehosted
String? kraken2_docker
String? kraken2_database
🧪 Testing
The database was tested locally with the standalone task and the custom database and worked as expected.
Further testing was successfully done on Terra as follows:
Bacterial pathogens and TheiaProk workflows: call_kraken=true and kraken_db=gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz
TheiaProk_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/10eddd67-5216-4529-b355-8a14947de621
TheiaProk_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/7206c5d6-fd58-448c-becf-1abb4ed27c5c
Viral Pathogens and TheiaCoV workflows
TheiaCoV_Illumina_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/7212fba6-b533-4afc-882f-a0aebc5a9157
TheiaCoV_Illumina_SE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/03a0096f-7636-47fe-be06-3446596f7daf
TheiaCoV_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/11f05bb7-022c-4d44-914b-992614202e46
RSV and TheiaCoV_Illumina_PE_PHB
RSV-A: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/8f315934-0771-40d9-914a-19abc7a9413d
best target for identification is "Human respiratory syncytial virus A"
RSV-B: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/73211a8e-21cb-4549-b424-01a7d636877a
best target for identification is "human respiratory syncytial virus"
RSV and TheiaCoV_Illumina_PE_PHB with updated Kraken target organism identifier
RSV-A: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/3546ab3d-c2cd-4c1e-ae3a-da78754733e7
RSV-B: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/ce3d507f-05df-403f-812e-7f781f761364
Bacterial and Viral Pathogens and Kraken2 standalone workflows
Kraken2_PE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/ac5dc25b-a305-4ca2-be52-3845eb683d4e
Kraken2_SE_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/fa7e5754-85de-4b34-929c-c11da23ab3d5
Kraken2_ONT_PHB: https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_Otieno_Sandbox/job_history/6845c8eb-ee26-402f-ac11-550591f7a7ac
Suggested Scenarios for Reviewer to Test
🔬 Final Developer Checklist
🎯 Reviewer Checklist